Instructions and Expectations

Background

The presidential election in 2012 did not come as a surprise. Some correctly predicted the outcome of the election correctly including Nate Silver, and many speculated his approach.

Despite the success in 2012, the 2016 presidential election came as a big surprise to many, and it was a clear example that even the current state-of-the-art technology can surprise us. Predicting voter behavior is complicated for many reasons despite the tremendous effort in collecting, analyzing, and understanding many available datasets. For our final project, we will analyze the 2016 presidential election dataset.

Answer the following questions in one paragraph for each.

  1. What makes voter behavior prediction (and thus election forecasting) a hard problem? Voter prediction behavior is difficult because there are many unknown sources of error embedded in the surveys that are released to the public in advance of the election. One of the most prominent issues is that of non-response bias which boils down to the individuals actually responding to the survey not providing a representative proportion of candidate preferences across the population. For example, if Hillary supporters are the primary indivdiauls responding to the survey, then it may seem like the survey results suggest a very strong chance of Hillary winning the election when in actuallity there are many Trump supporters who just don’t care to respond to online surveys. Efforts can be made to minimize this unpredictability through improved distribution of the survey across a variety of channels and accounting for non-response bias in forecasting formulas, but this remains difficult to quantify. Unpredictability in survey results can be further exacerbated if the individuals responding to surveys also are not actually making their way to the voting polls while non-respondents to the surveys may take action to vote on Election day. Furthermore, even if poll responses are fairly uniform, voters can also change their decision about who they are voting for in the swing of a moment for uncontrollable reasons which can make forecasts very susceptible to error.

  2. What was unique to Nate Silver’s approach in 2012 that allowed him to achieve good predictions? Nate Silver uses a hierarchical modelling approach that generates a time series of state level voter preferences that accounts for key variables in state-level voting population (wealth, race, etc.). Incorporating temporally variable shifts that favor one candidate over another such as a drop in unemployment at the state level or an increase in national taxes can further improve the estimation of voter behavior predictions over time. Silver’s application of hierarchical modelling is particularly powerful for accounting for how state and national polls can influence each other, especially if some state polls occur before others. The hierarchical modelling framework allows for information to move throughout the model over time to inform new predictions. Silver’s approach is also particularly advanced because he calculates the full range of probabilities for a candidate winning at the state level on a given day rather than just the maximum probability. Combining this with the hierarchical model and simulating predictions forward with weighting for the probability that the starting point was correct allows for particularly high accuracy. At the end of the day, the wealth of polling data leading up to the 2012 election also was essential to the accuracy of Silver’s prediction strategy.

  3. What went wrong in 2016? What do you think should be done to make future predictions better? A key issue in the 2016 election predictions seems to be the presences of systemic polling error. Errors in individual level polls accrued greater error when aggregated to state levels that came to misrepresent predicted outcomes to an even greater extent for national level predictions. Polling methods (live versus recorded voice surveys) showed differing degrees of support for candidates. In the case of the 2016 election, many Trump supporters may have been less willing to voice their preference in a poll, especially to a live voice. This variation in respondent preference could lead to misleading poll data, which caused the Nate Silver’s predictions to suffer. Furthermore, the lackluster turnouts for both Democratic poll respondents and voters on Election Day, particularly in the Midwest region, also led to skewed predictions. In order to improve future election predictions, forecasting methods will rely on more robust poll data similar to what was available leading up to the 2012 elections to provide an accurate basis for predictions. With sparse data that also bears an underlying bias, pollsters will struggle to make accurate predictions. It is unclear if respondents will take polls more or less seriously in the aftermath of the 2016 election, and pollsters will also need to attempt to account for an imbalance in the shift in actions of poll respondents with different preferences. Future polling results may also gain reliability by disclaiming greater uncertainty in their reports. While this is counterintuitive with polls trying to claim the highest accuracy possible, acknowledging the inherent uncertainty could lead polling results to have a less severe impact on voter behavior.

Data

election.raw = read.csv("data/election/election.csv") %>% as.tbl
census_meta = read.csv("data/census/metadata.csv", sep = ";") %>% as.tbl
census = read.csv("data/census/census.csv") %>% as.tbl

Election data

The meaning of each column in election.raw is clear except fips. The accronym is short for Federal Information Processing Standard.

In our dataset, fips values denote the area (US, state, or county) that each row of data represent. For example, a fips value of 6037 denotes Los Angeles County.

county fips candidate state votes
Los Angeles County 6037 Hillary Clinton CA 2464364
Los Angeles County 6037 Donald Trump CA 769743
Los Angeles County 6037 Gary Johnson CA 88968
Los Angeles County 6037 Jill Stein CA 76465
Los Angeles County 6037 Gloria La Riva CA 21993

Some rows in election.raw are summary rows and these rows have county value of NA. There are two kinds of summary rows:

  • Federal-level summary rows have a fips value of US.
  • State-level summary rows have the respective state name as the fips value.
  1. Report the dimension of election.raw after removing rows with fips=2000. Provide a reason for excluding them. Please make sure to use the same name election.raw before and after removing those observations.

After removing rows with fips = 2000, the election.raw table has 18345 observations and 5 variables. Alaska has a fips value of 2000, so the rows where fips=2000 are indeed state-level summary of election results. However, the state-level summary rows of Alaska are already available when we read the data, so it makes no sense to have duplicate records.

Census data

Following is the first few rows of the census data:

State County TotalPop Men Women Hispanic White Black Native Asian Pacific Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office Construction Production Drive Carpool Transit Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment
Alabama Autauga 1948 940 1008 0.9 87.4 7.7 0.3 0.6 0.0 1503 61838 11900 25713 4548 8.1 8.4 34.7 17.0 21.3 11.9 15.2 90.2 4.8 0 0.5 2.3 2.1 25.0 943 77.1 18.3 4.6 0 5.4
Alabama Autauga 2156 1059 1097 0.8 40.4 53.3 0.0 2.3 0.0 1662 32303 13538 18021 2474 25.5 40.3 22.3 24.7 21.5 9.4 22.0 86.3 13.1 0 0.0 0.7 0.0 23.4 753 77.0 16.9 6.1 0 13.3
Alabama Autauga 2968 1364 1604 0.0 74.5 18.6 0.5 1.4 0.3 2335 44922 5629 20689 2817 12.7 19.7 31.4 24.9 22.1 9.2 12.4 94.8 2.8 0 0.0 0.0 2.5 19.6 1373 64.1 23.6 12.3 0 6.2
Alabama Autauga 4423 2172 2251 10.5 82.8 3.7 1.6 0.0 0.0 3306 54329 7003 24125 2870 2.1 1.6 27.0 20.8 27.0 8.7 16.4 86.6 9.1 0 0.0 2.6 1.6 25.3 1782 75.7 21.2 3.1 0 10.8
Alabama Autauga 10763 4922 5841 0.7 68.5 24.8 0.0 3.8 0.0 7666 51965 6935 27526 2813 11.4 17.5 49.6 14.2 18.2 2.1 15.8 88.0 10.5 0 0.0 0.6 0.9 24.8 5037 67.1 27.6 5.3 0 4.2
Alabama Autauga 3851 1787 2064 13.1 72.9 11.9 0.0 0.0 0.0 2642 63092 9585 30480 7550 14.4 21.9 24.2 17.5 35.4 7.9 14.9 82.7 6.9 0 0.0 6.0 4.5 19.8 1560 79.4 14.7 5.8 0 10.9

Census data: column metadata

Column information is given in the metadata file.

Data wrangling

  1. Move summary rows from election.raw data into federal or state level summary files: i.e.,

    • Federal-level summary into a election_federal.

    • State-level summary into a election_state.

    • Only county-level data is to remain in election.

  2. How many named presidential candidates were there in the 2016 election? Draw a bar chart of all votes received by each candidate. You can split this into multiple plots or may prefer to plot the results on the log scale. Either way, the results should be clear and legible!

The 2016 election had a total of 31 candidates with a 32nd category for additional candidates that collected a total of 28,863 votes across the US.

  1. Create variables county_winner and state_winner by taking the candidate with the highest proportion of votes. Hint: to create county_winner, start with election, group by fips, compute total votes, and pct = votes/total. Then choose the highest row using top_n (variable state_winner is similar).

Visualization

2016 Election: State Winner Map

## Warning: Column `fips` joining character vector and factor, coercing into
## character vector

2016 Election: County Winner Map

## Warning: Column `fips` joining factors with different levels, coercing to
## character vector

  1. Create a visualization of your choice using census data. Many exit polls noted that demographics played a big role in the election. Use this Washington Post article and this R graph gallery for ideas and inspiration.

Minority Composition of Voters by State

State Level Winner Shaded by State Population Minority Fraction

  1. The census data contains high resolution information (more fine-grained than county-level).
    In this problem, we aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. Create the following variables:

    • Clean census data census.del: start with census, filter out any rows with missing values, convert {Men, Employed, Citizen} attributes to percentages (meta data seems to be inaccurate), compute Minority attribute by combining {Hispanic, Black, Native, Asian, Pacific}, remove these variables after creating Minority, remove {Walk, PublicWork, Construction}.
      Many columns seem to be related, and, if a set that adds up to 100%, one column will be deleted. E.g., Men and Women comprise 100% of the TotalPop, so we only two of the counts to know the third, and would choose one to delete.

    • Sub-county census data, census.subct: start with census.del from above, group_by() two attributes {State, County}, use add_tally() to compute CountyTotal. Also, compute the weight by TotalPop/CountyTotal.

    • County census data, census.ct: start with census.subct, use summarize_at() to compute the weighted sum.

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
## # Before:
## funs(name = f(.)
## 
## # After: 
## list(name = ~f(.))
## This warning is displayed once per session.
* _Print the first few rows of `census.ct`_: 
Start of Census Data
State County CountyPop Men White Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office Production Drive Carpool Transit OtherTransp WorkAtHome MeanCommute Employed PrivateWork SelfEmployed FamilyWork Unemployment Minority weight
Alabama Autauga 6486.404 0.4843266 75.78823 0.7374912 51696.29 7771.009 24974.50 3433.674 12.91231 18.70758 32.79097 17.17044 24.28243 17.15713 87.50624 8.781235 0.0952590 1.3059687 1.8356531 26.50016 0.4343637 73.73649 5.433254 0.0000000 7.733726 22.53687 1
Alabama Baldwin 7698.957 0.4884866 83.10262 0.7569406 51074.36 8745.050 27316.84 3803.718 13.42423 19.48431 32.72994 17.95092 27.10439 11.32186 84.59861 8.959078 0.1266209 1.4438000 3.8504774 26.32218 0.4405113 81.28266 5.909353 0.3633269 7.589820 15.21426 1
Alabama Barbour 3325.195 0.5382816 46.23159 0.7691222 32959.30 6031.065 16824.22 2430.189 26.50563 43.55962 26.12404 16.46343 23.27878 23.31741 83.33021 11.056609 0.4954032 1.6217251 1.5019456 24.51828 0.3192113 71.59426 7.149837 0.0897742 17.525557 51.94382 1
Alabama Bibb 6380.718 0.5341090 74.49989 0.7739781 38886.63 5662.358 18430.99 3073.599 16.60375 27.19708 21.59010 17.95545 17.46731 23.74415 83.43488 13.153641 0.5031366 1.5620952 0.7314679 28.71439 0.3669262 76.74385 6.637936 0.3941515 8.163104 24.16597 1
Alabama Blount 7018.573 0.4940565 87.85385 0.7337550 46237.97 8695.786 20532.27 2052.055 16.72152 26.85738 28.52930 13.94252 23.83692 20.10413 84.85031 11.279222 0.3626321 0.4199411 2.2654133 34.84489 0.3844914 81.82671 4.228716 0.3564928 7.699640 10.59474 1
Alabama Bullock 4263.211 0.5300618 22.19918 0.7545420 33292.69 9000.345 17579.57 3110.645 24.50260 37.29116 19.55253 14.92420 20.17051 25.73547 74.77277 14.839127 0.7732160 1.8238247 3.0998783 28.63106 0.3619592 79.09065 5.273684 0.0000000 17.890026 76.53587 1
  1. If you were physically located in the United States on election day for the 2016 presidential election, what state and county were you in? Compare and contrast these county results, demographic information, etc., against the state it is located in. If you were not in the United States on election day, select a county that appears to stand apart from the ones surrounding it. Do you find anything unusual or surprising? If not, what do you hypothesise might be the reason for the county and state results?

Dimensionality reduction

  1. Run PCA for both county & sub-county level data. Save the first two principle components PC1 and PC2 into a two-column data frame, called ct.pc and subct.pc, for county and sub-county respectively. Discuss whether you chose to center and scale the features before running PCA and the reasons for your choice. What are the three features with the largest absolute values of the first principal component? Which features have opposite signs and what does that mean about the correaltion between these features?

For this PCA at both the subcounty and county levels, we opted to center and scale the features before running PCA because the magnitude of the different covariates used to generate the principal components can vary considerably. For example, the magnitude of the TotalPop value will inherently be on a different order scale from the documented percentage values. Without this scaling, most of the principal components would be driven by the TotalPop variable that has the highest mean and variance of all the variables. We wanted to center the data such that the effects of different principal components could be effectively compared from a common mean of zero.

kable(subct.pc[order(abs(subct.pr$rotation[,1]), decreasing=TRUE), ], col.names = c("PC1 Ordered", "PC2"), caption = "Subcounty PCA Loadings")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
Subcounty PCA Loadings
PC1 Ordered PC2
1.8799435 -1.7903730
-0.2223740 -1.2470990
-0.2051629 -2.0079701
-0.7511110 -1.5172255
-0.3704778 -1.8203649
1.2713954 -0.7739458
0.2448866 -1.5654512
-0.3158999 -1.3582036
-0.2987879 -2.0226222
-1.6035325 -0.7210582
-1.3703754 -1.0523781
1.6308928 -1.9615056
-0.1688217 -0.6240623
-2.7141429 -1.3163688
-0.6111902 -0.9574739
-0.1616315 -1.7190563
-0.9059190 -1.4085426
-0.2913751 -0.6116355
-0.0520694 -1.9993488
2.7212503 -0.9658149
-1.1821343 -0.9858235
-1.3663013 -1.3485339
0.2550698 -1.4874826
2.5407220 -1.0203307
0.1397307 -1.6862941
0.0618780 -1.6162404
1.3548816 -1.8814244
-2.6777629 -0.7000684
kable(subct.pc[order(abs(subct.pr$rotation[,2]), decreasing=TRUE), ], col.names = c("PC1", "PC2 Ordered"), caption = "Subcounty PCA Loadings")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
Subcounty PCA Loadings
PC1 PC2 Ordered
-0.0520694 -1.9993488
-0.9059190 -1.4085426
-0.2987879 -2.0226222
-0.3158999 -1.3582036
0.2550698 -1.4874826
-2.6777629 -0.7000684
-0.1616315 -1.7190563
1.3548816 -1.8814244
-0.1688217 -0.6240623
-1.3703754 -1.0523781
1.6308928 -1.9615056
1.8799435 -1.7903730
-0.7511110 -1.5172255
-0.2223740 -1.2470990
2.7212503 -0.9658149
-2.7141429 -1.3163688
0.2448866 -1.5654512
-1.6035325 -0.7210582
1.2713954 -0.7739458
-1.1821343 -0.9858235
-0.2051629 -2.0079701
2.5407220 -1.0203307
0.1397307 -1.6862941
0.0618780 -1.6162404
-0.6111902 -0.9574739
-0.3704778 -1.8203649
-1.3663013 -1.3485339
-0.2913751 -0.6116355
kable(ct.pc[order(abs(ct.pr$rotation[,1]), decreasing=TRUE), ], col.names = c("PC1 Ordered", "PC2"), caption = "County PCA Loadings")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
County PCA Loadings
PC1 Ordered PC2
2.9949113 -0.3651833
0.8509903 0.6892077
2.1312585 -2.0692849
0.7364145 0.1924133
0.7017505 -2.1698401
1.3278622 -0.6520276
0.9449785 -1.6447003
-0.2389475 -1.9785152
4.1988349 0.0683149
1.7259839 -1.0476904
3.0289534 -0.7681529
3.4814146 -0.4731179
4.0378067 -0.9768704
1.5984119 -0.2959302
1.0513009 -0.6938956
0.6940356 -0.0832079
0.2403462 -0.4233737
0.8703187 -1.5079796
1.5610517 0.0135212
0.5866700 -0.6149344
4.9462389 -0.5573296
-0.3501971 -2.0554138
3.3324835 -1.9561621
5.3731041 -0.2952842
-1.0421788 -2.3453701
1.9171297 -0.1546236
3.4668768 -0.4790275
kable(ct.pc[order(abs(ct.pr$rotation[,2]), decreasing=TRUE), ], col.names = c("PC1", "PC2 Ordered"), caption = "County PCA Loadings")  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE) %>% scroll_box(width = "100%")
County PCA Loadings
PC1 PC2 Ordered
1.0513009 -0.6938956
0.5866700 -0.6149344
-0.3501971 -2.0554138
3.4814146 -0.4731179
3.3324835 -1.9561621
4.9462389 -0.5573296
4.0378067 -0.9768704
1.5610517 0.0135212
0.6940356 -0.0832079
0.7017505 -2.1698401
1.9171297 -0.1546236
-1.0421788 -2.3453701
0.8703187 -1.5079796
1.7259839 -1.0476904
4.1988349 0.0683149
-0.2389475 -1.9785152
2.9949113 -0.3651833
1.3278622 -0.6520276
0.2403462 -0.4233737
5.3731041 -0.2952842
0.7364145 0.1924133
2.1312585 -2.0692849
3.0289534 -0.7681529
0.8509903 0.6892077
0.9449785 -1.6447003
3.4668768 -0.4790275
1.5984119 -0.2959302

After running the PCA, the three features that have the largest magnitude for PC1 at the subcounty level are IncomePerCap, Professional, and Poverty covariates. The top three covariates with the largest magnitude of PC1 at the county level are IncomePerCap, ChildPoverty, and Poverty.

County & Subcounty PC1 Top 3 Absolute Loadings
PC1 Feature Administrative Level
0.3181199 IncomePerCap Subcounty
0.3064366 Professional Subcounty
0.3046886 Poverty Subcounty
0.3506105 IncomePerCap County
0.3432571 ChildPoverty County
0.3420707 Poverty County
There are many instances in which features used in the PCA have different signs for PC1 at both the subcounty and county levels as is highlighted in the two tables below. It is clear that features with the same sign are more correlated than features with opposite signs. This also makes logical sense with the features. For example, at the county level, Income, IncomeErr, IncomePerCap, and IncomePerCapErr all have positive values, which suggests these are highly correlated and we would expect all of these to be correlated. Following that sentiment, the Poverty, ChildPoverty, Minority, and Unemployment features all have negative signs and similar magnitudes suggesting correlation between these features as we unfortunately might expect.
Subcounty Comparison of Signs for Loadings of Different Features
PC1 PC2 Feature
-0.0324 -0.0324 SubcountyPop
-0.0173 -0.0173 Men
-0.2404 -0.2404 White
-0.1608 -0.1608 Citizen
-0.3025 -0.3025 Income
-0.1989 -0.1989 IncomeErr
-0.3181 -0.3181 IncomePerCap
-0.2123 -0.2123 IncomePerCapErr
0.3047 0.3047 Poverty
0.2979 0.2979 ChildPoverty
-0.3064 -0.3064 Professional
0.2688 0.2688 Service
0.0138 0.0138 Office
0.2068 0.2068 Production
-0.0789 -0.0789 Drive
0.1626 0.1626 Carpool
0.0573 0.0573 Transit
0.0451 0.0451 OtherTransp
-0.1730 -0.1730 WorkAtHome
-0.0100 -0.0100 MeanCommute
-0.2212 -0.2212 Employed
0.0420 0.0420 PrivateWork
-0.0697 -0.0697 SelfEmployed
-0.0152 -0.0152 FamilyWork
0.2528 0.2528 Unemployment
0.2420 0.2420 Minority
0.0215 0.0215 CountyTotal
0.0119 0.0119 weight
County Comparison of Signs for Loadings of Different Features
PC1 PC2 Feature
0.0243 -0.3428 CountyPop
-0.0076 0.1506 Men
-0.2242 0.1262 White
-0.0064 0.1631 Citizen
-0.3181 -0.2069 Income
-0.1679 -0.2558 IncomeErr
-0.3506 -0.1211 IncomePerCap
-0.1937 -0.1307 IncomePerCapErr
0.3421 0.0267 Poverty
0.3433 0.0238 ChildPoverty
-0.2497 -0.0126 Professional
0.1817 0.0259 Service
0.0174 -0.2838 Office
0.1184 -0.0090 Production
0.0956 -0.2167 Drive
0.0766 0.0488 Carpool
-0.0698 -0.1418 Transit
0.0097 0.0425 OtherTransp
-0.1770 0.3169 WorkAtHome
0.0606 -0.2349 MeanCommute
-0.3277 -0.0314 Employed
-0.0540 -0.3435 PrivateWork
-0.1008 0.3928 SelfEmployed
-0.0505 0.2643 FamilyWork
0.2914 -0.0938 Unemployment
0.2276 -0.1251 Minority
0.0054 0.0106 weight
  1. Determine the minimum number of PCs needed to capture 90% of the variance for both the county and sub-county analyses. Plot the proportion of variance explained (PVE) and cumulative PVE for both county and sub-county analyses.

At the county level, at least 14 principal components are required to capture 90% of the variance. At the subcounty level, 17 principal components are needed to represent 90% of the variance in the data.

Clustering

  1. With census.ct, perform hierarchical clustering with complete linkage. Cut the tree to partition the observations into 10 clusters. Re-run the hierarchical clustering algorithm using the first 5 principal components of ct.pc as inputs instead of the original features. Compare and contrast the results. For both approaches investigate the cluster that contains San Mateo County. Which approach seemed to put San Mateo County in a more appropriate cluster? Comment on what you observe and discuss possible explanations for these observations.

San Mateo county is found in clusters 1 and 1 for the hierarchical clustering using the census data and first five PC, respectively. Based on the two visualizations of each clustering approach, one can start to evealuate which clustering approach is qualitatively superior. The San Mateo cluster generated using the census data (cluster 2, orange triange) exhibits considerable overlap with other clusters which is indicative of poor clustering. This approach also included 379 counties in the San Mateo cluster. By comparison, the clustering based on principal components data seems to do a better job of isolating the San Mateo county cluster (cluster 5, green diamond), and this approach only included 112 counties in the cluster with San Mateo. This decrease in cluster size likely assisted in reducing variation.

This qualitative analysis is further supported by a quantitative measure of within cluster variation that is smaller for the PC-based clustering as opposed to that for the raw-census data clustering (3.888027610^{8} versus 2.834575810^{8}).

Based on these results, the hierarchical clustering approach that utilized the first five PC generated from PCA performed on the census data seemed to produce a more appropriate cluster for San Mateo County as opposed to hierarchical clustering based on the raw census data.

Classification

In order to train classification models, we need to combine county_winner and census.ct data. This seemingly straightforward task is harder than it sounds. The following code makes the necessary changes to merge them into election.cl for classification.

Using the following code, partition data into 80% training and 20% testing:

Using the following code, define 10 cross-validation folds:

Using the following error rate function:

  1. Decision tree: train a decision tree by cv.tree(). Prune the resulting tree to minimize misclassification error. Be sure to use the folds from above for cross-validation. Visualize the trees before and after pruning. Save training and test errors to a records variable. Intepret and discuss the results of the decision tree analysis. Use this plot to tell a story about voting behavior in the US (remember the NYT infographic?)

The first tree below shows the tree that is generated from using all of the regressors in the training data from the election data set is used to predict candidate outcome.

This second tree below shows the pruned version of the original tree that is produced after usign a 10 fold cross-valication approach to determine the best size of tree to minimize classification error. This tree has been pruned to the size determined by that tuning parameter.

  1. Run a logistic regression to predict the winning candidate in each county. Save training and test errors to the records variable. What are the significant variables? Are these consistent with what you observed in the decision tree analysis? Interpret the meaning of a couple of the significant coefficients in terms of a unit change in the variables. Did your particular county (from question 13) results match the predicted results?
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

The most significant variables in predicting the winninng candidate with a logistic function are Citizen, Professional, Service, Production, Drive, Employed, Private Work and Unemploymet each with a p ~ 0. Other variables that were quite signficant include White, IncomePerCap, Carpool, and FamilyWork with a p <0.001 as well as Office with a p < 0.01 and Income with a p < 0.05. It seems that many of the variables are significant, which makes sense as we have a perfect separation in our result. In the decision tree we see that the main splitting variable was Transit where the majority of nodes predicting Trump fell to the left side and the majority of nodes predicting Clinton fell to the right side. Transit in the logistic model, on the other hand, was not a signficiant predictor. White was used in both models in the second tier of the decision tree and a relatively strong predictor in the logsitic model. CountyPop was used to make the second split in the right side of the tree but was not a significant predictor in the logistic model. This may be because as we can see in the tree only very large counties would predict Hillary while there was more variation in the medium to small counties. Professional and Production were two variables that were consistent between the decision tree and logsitic regression analysis. The production variable, however, was an end node in the decision tree that resulted in two outputs of Trump and therefore was not very helpful at distinguishing between the candidates.

**talk about home county…

  1. You may notice that you get a warning glm.fit: fitted probabilities numerically 0 or 1 occurred. As we discussed in class, this is an indication that we have perfect separation (some linear combination of variables perfectly predicts the winner). This is usually a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization. Use the cv.glmnet function from the glmnet library to run K-fold cross validation and select the best regularization parameter for the logistic regression under the LASSO penalty. Reminder: set alpha=1 to run LASSO. What are the non-zero coefficients in the LASSO regression for the optimal value of \(\lambda\)? How do they compare to the unpenalized logistic regression? Save training and test errors to the records variable.

Estimated Coefficient Values from LASSO Regression
Coefficient Estimate
(Intercept) -2.717
CountyPop 0.000
Men 0.000
White -0.036
Citizen 3.738
Income 0.000
IncomeErr 0.000
IncomePerCap 0.000
IncomePerCapErr 0.000
Poverty 0.070
ChildPoverty -0.012
Professional 0.108
Service 0.117
Office 0.029
Production 0.046
Drive -0.056
Carpool -0.046
Transit 0.000
OtherTransp 0.019
WorkAtHome -0.027

The non-zero coefficients for the LASSO model are White, Citizen, Poverty, ChildPoverty, Professional, Service, Office, Production, Drive, Carpool, OtherTransp, and WorkAtHome.

  1. Compute ROC curves for the decision tree, logistic regression and LASSO logistic regression using predictions on the test data. Display them on the same plot. Based on your classification results, discuss the pros and cons of the various methods. Are the different classifiers more appropriate for answering different kinds of questions about the election?
Summary of Training and Test Errors Associated with Different Classification Models
Training Error Test Error
tree 0.0883550 0.0975610
logistic 0.0663681 0.0780488
lasso 0.0728827 0.0943089

Taking it further

  1. This is an open question. Interpret and discuss any overall insights gained in this analysis and possible explanations. Use any tools at your disposal to make your case: visualize errors on the map, discuss what does or doesn’t seem reasonable based on your understanding of these methods, propose possible directions (collecting additional data, domain knowledge, etc). In addition, propose and tackle at least one more interesting question. Creative and thoughtful analyses will be rewarded! _This part will be worth up to 20% of your final project grade!

Some possibilities for further exploration are:

Bagging

trn.cl.new <- droplevels(trn.cl)
tst.cl.new <- droplevels(tst.cl)

bag.candidates = randomForest(candidate ~ ., data=trn.cl.new , mtry=6, importance=TRUE) 
bag.candidates
## 
## Call:
##  randomForest(formula = candidate ~ ., data = trn.cl.new, mtry = 6,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 5.99%
## Confusion matrix:
##                 Donald Trump Hillary Clinton class.error
## Donald Trump            2047              40  0.01916627
## Hillary Clinton          107             262  0.28997290
plot(bag.candidates)
legend("top", colnames(bag.candidates$err.rate),col=1:4,cex=0.8,fill=1:4)

yhat.bag = predict(bag.candidates, newdata = tst.cl.new)

# Confusion matrix
bag.err = table(pred = yhat.bag, truth = tst.cl.new$candidate)
test.bag.err = 1 - sum(diag(bag.err))/sum(bag.err)
test.bag.err
## [1] 0.05528455

The test set error rate of the bagged classification tree is 0.0553 which is an improvement by 4.22% from the test error rate using an optimally-pruned single tree which was (0.0976).

Random Forest

rf.candidates = randomForest(candidate ~ ., data=trn.cl.new , mtry=2, importance=TRUE) 
rf.candidates
## 
## Call:
##  randomForest(formula = candidate ~ ., data = trn.cl.new, mtry = 2,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 6.72%
## Confusion matrix:
##                 Donald Trump Hillary Clinton class.error
## Donald Trump            2058              29  0.01389554
## Hillary Clinton          136             233  0.36856369
plot(rf.candidates)

yhat.rf = predict (rf.candidates, newdata = tst.cl.new)

# Confusion matrix
rf.err = table(pred = yhat.rf, truth = tst.cl.new$candidate)
test.rf.err = 1 - sum(diag(rf.err))/sum(rf.err)
test.rf.err
## [1] 0.06666667

While the overall test error is not very different between the bagging (0.0553) and random forest methods (0.0667), the class error for Hillary Clinton worsens from 0.2812 in the bagged forest and 0.3144 in the random forest model as the class error improves slightly for Donald Trump from 0.0192 to 0.0177 respectively.

importance(bag.candidates)
##                 Donald Trump Hillary Clinton MeanDecreaseAccuracy
## CountyPop          10.085981       10.164926            15.030834
## Men                 8.793946       12.202654            15.408343
## White              24.519136       33.553958            33.874999
## Citizen            14.305795        6.781692            16.678137
## Income             11.417194       13.330281            16.954989
## IncomeErr           2.985723       10.320226            10.880983
## IncomePerCap       13.384973       14.520618            18.680711
## IncomePerCapErr     1.693391       14.067144            13.990590
## Poverty            12.815034        7.401784            15.239224
## ChildPoverty        9.635554        6.080630            12.441863
## Professional       15.198920       20.556131            22.745329
## Service            10.490461       11.252728            15.320284
## Office              8.874234        8.311032            12.361477
## Production          7.354873       13.883994            15.783627
## Drive              13.173534       11.859199            17.766385
## Carpool             7.230824        8.939608            12.730542
## Transit            14.481630       46.250004            44.099451
## OtherTransp         2.499826       12.601003            11.226066
## WorkAtHome          9.219281       10.415585            13.180210
## MeanCommute        12.253951        7.760760            14.468004
## Employed           10.538084       16.805031            17.673278
## PrivateWork        11.585718        3.946106            12.411532
## SelfEmployed       10.398478       12.274486            15.434457
## FamilyWork          3.006256        9.368939             8.659078
## Unemployment       14.342196       18.125810            23.229878
## Minority           22.424936       26.930510            30.624428
## weight              0.000000        0.000000             0.000000
##                 MeanDecreaseGini
## CountyPop              13.258746
## Men                    17.904002
## White                  83.643692
## Citizen                16.173338
## Income                 18.470973
## IncomeErr              14.617777
## IncomePerCap           21.890871
## IncomePerCapErr        15.244334
## Poverty                19.305470
## ChildPoverty           11.345730
## Professional           28.916685
## Service                16.523954
## Office                 13.182352
## Production             19.170735
## Drive                  15.573892
## Carpool                10.821920
## Transit               100.658479
## OtherTransp            13.191136
## WorkAtHome              9.929198
## MeanCommute            13.100743
## Employed               17.035050
## PrivateWork            11.010123
## SelfEmployed           15.860071
## FamilyWork              9.671087
## Unemployment           24.936221
## Minority               73.686429
## weight                  0.000000
importance(rf.candidates)
##                 Donald Trump Hillary Clinton MeanDecreaseAccuracy
## CountyPop          12.406061        8.229516            14.800457
## Men                 5.698375       13.072569            12.808408
## White              20.092723       27.145769            26.791478
## Citizen             8.218701        8.030573            11.871149
## Income             11.268533       13.654612            17.941923
## IncomeErr           1.638665       13.154208            12.624747
## IncomePerCap       11.499110       14.469905            18.589353
## IncomePerCapErr    -1.386790       15.304858            12.585600
## Poverty            14.607844        8.490788            17.863356
## ChildPoverty        9.891028        7.919977            13.659752
## Professional       12.370138       19.440270            19.402341
## Service            11.601683       10.023367            16.306188
## Office              8.602865        7.023970            11.139288
## Production          6.272125       14.703312            14.940583
## Drive              15.116874       14.034496            19.725406
## Carpool             7.094749        5.902188             9.146950
## Transit            12.675408       29.001076            28.890978
## OtherTransp         1.489983       12.019211             9.340500
## WorkAtHome          7.471765        8.459778            11.014913
## MeanCommute        10.268248        5.475799            11.412429
## Employed            7.770005       18.108829            16.983414
## PrivateWork         8.633675        3.981321             9.979213
## SelfEmployed        9.837554       12.414358            14.725507
## FamilyWork          1.719984       12.498991             9.892522
## Unemployment       13.081726       15.924146            19.463476
## Minority           20.326633       26.671118            27.367773
## weight              0.000000        0.000000             0.000000
##                 MeanDecreaseGini
## CountyPop           16.314371543
## Men                 20.631733996
## White               59.013466464
## Citizen             19.003874667
## Income              23.176293811
## IncomeErr           20.576809391
## IncomePerCap        25.395491874
## IncomePerCapErr     19.738382672
## Poverty             22.977009066
## ChildPoverty        19.579069731
## Professional        28.035011699
## Service             19.211738725
## Office              15.016555995
## Production          21.981965141
## Drive               21.011376034
## Carpool             12.994329429
## Transit             61.921525148
## OtherTransp         17.275162066
## WorkAtHome          13.736410686
## MeanCommute         14.594803917
## Employed            19.516894794
## PrivateWork         13.432930257
## SelfEmployed        19.555451400
## FamilyWork          14.137843076
## Unemployment        25.368473333
## Minority            58.297769108
## weight               0.003271627
par(mfrow=c(2, 2))
varImpPlot(bag.candidates, sort=T, main="Variable Importance for bag.candidates", n.var=5)

varImpPlot(rf.candidates, sort=T, main="Variable Importance for rf.candidates", n.var=5)